The Forecaster’s Toolbox

Lecture 5

Before building a complex model, what simple benchmarks should every forecaster know?

Four benchmark forecasting methods

Mean method — forecast all future values with the historical mean.

ŷ_T+h = ŷ = (1/T) ∑ y_t
Best when the series fluctuates randomly around a constant level.

Naïve method — use the most recent observation as the forecast.

ŷ_T+h = y_T — optimal for a random walk.

Seasonal naïve — use the most recent value from the same season.

ŷ_T+h = y_{T+h−m⌈h/m⌉} where m is the seasonal period.
Strong benchmark for highly seasonal series (retail, tourism, energy).

Drift method — extend the trend from the first to the last observation.

ŷ_T+h = y_T + h · (y_T − y₁) / (T − 1)

A model that cannot beat a naïve benchmark is not worth using.

Benchmarks serve two purposes:

1. Sanity check. If your ARIMA model is worse than seasonal naïve, something is wrong — overfit training data, wrong differencing, or an implementation error.

2. Communication. Stakeholders understand “we beat last year’s same month by 5% in accuracy” far more readily than an abstract loss function value.

In fpp3: model(Mean = MEAN(y), Naive = NAIVE(y), SNaive = SNAIVE(y), Drift = RW(y ~ drift())) fits all four simultaneously.

How do you know if a forecasting model has captured all the structure in the data?

Residual diagnostics

Residuals are the difference between observed and fitted values.

e_t = y_t − ŷ_t (in-sample; not the same as forecast errors on test data).

A well-specified model has residuals that are:

Uncorrelated — ACF of residuals shows no significant bars (white noise).
Zero mean — no systematic bias; forecasts are neither consistently high nor low.

Useful but not required: normality and constant variance.

Normality enables exact prediction intervals; without it, intervals are approximate.
Non-constant variance (heteroskedasticity) can be addressed with a Box-Cox transformation.

In fpp3: gg_tsresiduals(fit) plots residuals, ACF, and histogram in one call.

Portmanteau tests check residual autocorrelation across many lags at once.

Rather than inspecting individual ACF bars, portmanteau tests combine evidence across l lags into a single test statistic.

Box-Pierce statistic: Q = T ∑_k=1^l r_k²

Ljung-Box statistic (preferred for small samples): Q* = T(T+2) ∑_k=1^l r_k² / (T−k)

Both are approximately χ²(l) under the null of white noise. A small p-value rejects white noise — the model is missing structure. In fpp3: augment(fit) |> features(.innov, ljung_box, lag = 10).

How do we measure forecast accuracy fairly?

Accuracy must always be measured on data the model has never seen.

The standard split: fit the model on a training set, evaluate forecasts on a test set (held out at the end of the series).

Never evaluate a model on its own training data. A model can always memorize training data — this is overfitting, not skill. In-sample fit statistics (R², AIC) measure model parsimony, not forecast ability.

The test set should be at least as long as the forecast horizon you care about. A model that looks great at horizon 1 may be terrible at horizon 12.

In fpp3: filter(year(date) <= 2019) creates the training set; filter(year(date) > 2019) creates the test set.

Forecast errors on the test set measure true predictive performance.

Measure	Formula	Interpretation
MAE	mean(\|e_t\|)	Average absolute miss; same units as the series.
RMSE	√mean(e_t²)	Penalizes large errors more; same units as series.
MAPE	mean(\|e_t/y_t\|) × 100	Scale-free %; undefined when y_t = 0.
MASE	MAE / MAE_naïve	Scaled by naïve benchmark; < 1 means better than naïve.

MASE is the recommended default for comparing accuracy across series of different scales.

MASE is the gold standard for comparing forecasts across series.

The Mean Absolute Scaled Error scales the MAE by the in-sample MAE of the naïve (or seasonal naïve) method:

MASE = MAE / (1/(T−1) ∑_t=2^T |y_t − y_t−1|)

A MASE of 0.80 means the model’s average error is 80% of what a naïve method would produce — a 20% improvement. MASE > 1 means the model is worse than naïve.

MASE works even when the series contains zeros (unlike MAPE) and is meaningful across series with different units and scales.

What if a single train/test split gives an unreliable accuracy estimate?

Time series cross-validation evaluates accuracy across many rolling origins.

Also called evaluation on a rolling forecast origin. Instead of one train/test split, we repeatedly:

Train on observations 1…t.
Forecast observations t+1…t+h.
Advance the origin by one period and repeat.

The result is many forecast errors across many origins. Averaging gives a more stable, less noisy accuracy estimate than a single split.

In fpp3: stretch_tsibble(.init = 48, .step = 1) creates the rolling origins; then model() |> forecast(h = 12) |> accuracy() aggregates the errors.

Time series CV differs from standard k-fold CV in one critical way.

In standard k-fold CV for cross-sectional data, observations are randomly assigned to folds. The model trains on any 80% of data and tests on the remaining 20%, in any order.

This is invalid for time series. Future values cannot be used to predict the past. Training data must always precede the test period.

Time series CV respects temporal order: each training window only uses data from before the forecast origin. This prevents data leakage — accidentally using future information to fit the model.

Producing and interpreting prediction intervals

A prediction interval covers the true future value with a stated probability.

An 80% PI contains the true value in 80% of repeated forecasting instances.
A 95% PI is wider; an 80% PI is narrower. Both are valid — the choice depends on the decision context.

Under normality, a c% PI is:

ŷ_T+h ± z_(1+c/100)/2 · σ̂_h
For 80%: z = 1.28; for 95%: z = 1.96.

Intervals widen as the horizon increases.

Uncertainty compounds. The 12-step-ahead PI is always wider than the 1-step-ahead PI for the same model.

Bootstrap prediction intervals don’t require normality.

In fpp3: forecast(fit, bootstrap = TRUE, times = 1000).

Point forecast accuracy and distributional accuracy are separate things.

A model can produce accurate point forecasts but poorly calibrated prediction intervals — or vice versa.

Calibration: an 80% PI should contain the true value 80% of the time, not 60% or 95%. Poor calibration means the stated confidence is misleading.

Winkler score and CRPS (Continuous Ranked Probability Score) measure the quality of the full predictive distribution, penalizing both inaccurate point forecasts and miscalibrated uncertainty.

In fpp3: accuracy(forecasts, test_data, measures = list(CRPS = CRPS)).

Combining forecasts from multiple models often outperforms any single model.

This is one of the most robust findings in the forecasting literature, confirmed across every major forecasting competition (M1, M3, M4, M5).

The simplest combination: equal-weight average of k models’ forecasts. ŷ_combo = (1/k) ∑ ŷ_i

Why it works: different models make different errors. Averaging cancels out idiosyncratic mistakes and is less sensitive to model misspecification than relying on any single model.

In fpp3: model(m1 = ..., m2 = ..., m3 = ...) |> mutate(combo = (m1 + m2 + m3) / 3).

The complete fpp3 forecasting workflow

1. model() — fit one or more models to a tsibble.

Returns a mable (model table) — one row per series, one column per model.

2. forecast(h = ...) — generate point forecasts and prediction intervals.

Returns a fable (forecast table) with distributional forecasts.

3. accuracy() — compute error measures on test data.

Compare MAE, RMSE, MASE, MAPE across models side by side.

4. autoplot() — visualize forecasts with shaded prediction intervals.

Dark shading = 80% PI; light shading = 95% PI.

Choose error measures deliberately, not by default.

Use MAE or RMSE when all series share the same units and scale.

Use MASE when comparing across series of different scales, or when zeros in the series make MAPE undefined.

Use MAPE only when percentage errors are natural to the business (e.g., “we were off by 8%”) and the series never touches zero.

Use CRPS or Winkler score when distributional accuracy (not just point accuracy) matters for decisions — e.g., inventory stocking levels set at a specific quantile of demand.

Practice Questions

Question 1 of 4